Skip to content

feat: U8x64 byte-level ops for palette codec, nibble, byte scan (Pumpkin/SD) Added to all three tiers (AVX-512 / AVX2 / scalar): cmpeq_mask(other) → u64 — byte-wise equality, returns bitmask shr_epi16(imm) → Self — shift right 16-bit lanes (nibble extract) saturating_sub(other) — max(a-b, 0) per byte (delta subtraction) unpack_lo_epi8(other) — interleave low bytes (nibble interleave) unpack_hi_epi8(other) — interleave high bytes These operations are used by: palette_codec.rs — Minecraft-style variable-width bit packing nibble.rs — 4-bit light level packing (Pumpkin) byte_scan.rs — NBT format byte scanning (future) stable_diffusion/ — VAE latent palette encoding via GGUF All three are currently using raw _mm256_/_mm512_ intrinsics. Next step: rewire them to use crate::simd::U8x64 instead. https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp#76

Merged
AdaWorldAPI merged 3 commits into
masterfrom
claude/setup-embedding-pipeline-Fa65C
Apr 3, 2026

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

No description provided.

claude added 3 commits April 3, 2026 16:41
…ection

BF16↔f32 batch conversion via stable Rust 1.94:
  1. Runtime detect avx512bf16 + avx512vl
  2. as_chunks::<16>() → _mm512_cvtpbh_ps (16 BF16 → 16 f32)
  3. as_chunks::<8>() remainder → _mm256_cvtpbh_ps (8 BF16 → 8 f32)
  4. Scalar tail → f32::from_bits((bits as u32) << 16)

No LazyLock — slice chunking handles batch widths.
No nightly — as_chunks is stable since 1.94.

Reference: https://doc.rust-lang.org/beta/src/core/stdarch/crates/core_arch/src/x86/avx512bf16.rs.html

Types: BF16x16 (__m256bh), BF16x8 (__m128bh) — available when
target_feature avx512bf16 is enabled at compile time.

Functions (always available, scalar fallback built in):
  bf16_to_f32_batch(input: &[u16], output: &mut [f32])
  f32_to_bf16_batch(input: &[f32], output: &mut [u16])
  bf16_to_f32_scalar(bits: u16) → f32
  f32_to_bf16_scalar(v: f32) → u16

3 tests passing.

https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
…y_windows dispatch

Compile-time const (not LazyLock) — resolved by #[cfg(target_feature)]:
  AVX-512: F64=8, F32=16, U64=8, I16=32
  AVX2:    F64=4, F32=8,  U64=4, I16=16
  Scalar:  same as AVX2

Enables consumers to use array_windows::<{PREFERRED_F64_LANES}>()
for native-width SIMD processing without runtime branching.

https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
…kin/SD)

Added to all three tiers (AVX-512 / AVX2 / scalar):
  cmpeq_mask(other) → u64   — byte-wise equality, returns bitmask
  shr_epi16(imm) → Self     — shift right 16-bit lanes (nibble extract)
  saturating_sub(other)      — max(a-b, 0) per byte (delta subtraction)
  unpack_lo_epi8(other)      — interleave low bytes (nibble interleave)
  unpack_hi_epi8(other)      — interleave high bytes

These operations are used by:
  palette_codec.rs — Minecraft-style variable-width bit packing
  nibble.rs — 4-bit light level packing (Pumpkin)
  byte_scan.rs — NBT format byte scanning
  (future) stable_diffusion/ — VAE latent palette encoding via GGUF

All three are currently using raw _mm256_/_mm512_ intrinsics.
Next step: rewire them to use crate::simd::U8x64 instead.

https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
@AdaWorldAPI AdaWorldAPI merged commit 8ba065c into master Apr 3, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants